Skip to content

fix(llm): improve graph JSON parsing robustness for LLM outputs#332

Merged
imbajin merged 4 commits into
apache:mainfrom
linmengmeng-1314:fix/graph-extract-json-parsing
May 19, 2026
Merged

fix(llm): improve graph JSON parsing robustness for LLM outputs#332
imbajin merged 4 commits into
apache:mainfrom
linmengmeng-1314:fix/graph-extract-json-parsing

Conversation

@linmengmeng-1314
Copy link
Copy Markdown
Contributor

Summary

  • Improve _extract_and_filter_label to handle varying LLM output formats
  • Strip markdown code blocks before JSON extraction
  • Support both {"vertices":[...], "edges":[...]} (object) and flat array formats
  • Auto-convert flat arrays to the expected object structure

Problem

When using reasoning models (e.g., DeepSeek V4) for graph extraction, the LLM may return:

  1. JSON wrapped in markdown code blocks (\``json ... ```), which breaks the greedy regex ({.*})`
  2. A flat array [vertex, edge, ...] instead of the expected object {"vertices": [...], "edges": [...]}

Both cases cause json.JSONDecodeError and result in empty extraction output even though the LLM correctly identified entities and relationships.

Solution

  • Strip markdown code fences (\``json/````) before regex matching
  • Update regex to match both objects ({...}) and arrays ([...])
  • When a flat array is detected, partition items by type field into vertices and edges

Test plan

  • Test with OpenAI models (existing behavior should be preserved)
  • Test with DeepSeek models (markdown-wrapped array format)
  • Test with Ollama models
  • Verify both object and array formats are handled correctly

🤖 Generated with Claude Code

…tputs

Different LLMs return graph extraction results in varying formats:
- Some wrap JSON in markdown code blocks (```json ... ```)
- Some return a flat array of vertices/edges instead of a structured object

This causes json.JSONDecodeError when the greedy regex ({.*}) captures
invalid content from markdown-wrapped or array-formatted responses.

Changes:
- Strip markdown code blocks before JSON extraction
- Support both object ({...}) and array ([...]) JSON formats
- Auto-convert flat arrays to {"vertices": [...], "edges": [...]} format

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@dosubot dosubot Bot added size:S This PR changes 10-29 lines, ignoring generated files. bug Something isn't working labels May 18, 2026
@github-actions github-actions Bot added the llm label May 18, 2026
@imbajin imbajin requested a review from Copilot May 18, 2026 13:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

huaun-develop and others added 3 commits May 19, 2026 09:26
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- cover markdown fenced property graph JSON output
- cover flat array vertex and edge parsing
- keep tests scoped to _extract_and_filter_label behavior
- add coverage for fenced JSON with prose around it
- verify flat arrays drop invalid graph items
- cover malformed fenced JSON returning no graph items
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label May 19, 2026
@imbajin imbajin changed the title fix(graph): improve property graph JSON parsing robustness for LLM outputs fix(llm): improve graph JSON parsing robustness for LLM outputs May 19, 2026
@imbajin imbajin merged commit 016158f into apache:main May 19, 2026
13 checks passed
@linmengmeng-1314 linmengmeng-1314 review requested due to automatic review settings May 19, 2026 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working lgtm This PR has been approved by a maintainer llm size:S This PR changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants